Hey everyone! It’s been a while. I’ve had a pretty good reason for my absence — I’ve swapped waking up from my team’s nightly PagerDuty alarms for waking up from something far more delightful: my beautiful 3-month-old baby girl.
The comparison between a crying baby and a PagerDuty alert isn’t coincidental: In both cases, I go through the same emotional rollercoaster: denial, bargaining, and acceptance — all within a matter of minutes. With PagerDuty, it goes something like: “Maybe it’s a false alarm?”, “Maybe the error only happened once and will magically resolve if I click ‘acknowledge’?”, “Oh crap, I guess I really have to get up.” With the baby: “Did I hear that right?”, “Maybe if I wait a second she’ll settle back to sleep?”, “Oh crap, I really have to get up.” (Literally... crap, in the crying-baby example)
Over the past year, I’ve worked hard to improve our on-call experience — making alerts more accurate so no one loses sleep over false alarms. Along the way, I’ve learned a lot about what’s truly worth monitoring. So here it is: what you should be tracking in your app to make sure you’re only waking up for the things that really matter.
What Does Good Monitoring Look Like?
If you’re waking up in the middle of the night, it better be for a damn good reason. You want to be alerted only when something serious happens — not for minor bugs or things that can wait until morning. However, you want to know as soon as possible when things go bad. Ideally, you’d be alerted before your users notice anything wrong.
The good news? That’s totally doable, if you focus on the right metrics and make sure every alert includes enough context so anyone on-call can quickly take action — even when half-asleep.
The bad news? Setting up the right alerts is not a one-and-done task. It’s an ongoing effort. You’ll need to keep adjusting thresholds, adding and removing alerts, and continuously refining your monitoring setup — just like maintaining your codebase or sharing knowledge across the team.
Key Metrics to Monitor
There are tons of metrics you could monitor — but let’s focus on the ones that really move the needle. We’ll start with low-level system metrics and work our way up to high-level user behavior.

System Metrics (CPU, Disk Usage, etc.)
Before diving into application-specific monitoring, you need to ensure the underlying system is healthy.
Metrics like high CPU usage, memory pressure, or low disk space can lead to unexpected behavior, crashes, or even data loss.
Set up alerts for these so you can scale up resources or apply hotfixes before users feel the impact. If auto-scaling is an option — and it fits within your cost constraints — even better. (Think AWS SQS queue size, DB storage, etc.) That way, you’re only alerted when manual intervention is really needed.
These metrics are usually collected via a monitoring agent on the machine, like node_exporter
with Prometheus.
Service health (AKA “Liveness”)
This one’s non-negotiable. You need to know whether your service is actually up.
This is typically done by exposing a /live
or /health
endpoint and polling it regularly. Tools like UptimeRobot, BetterUptime, and Site24x7 can handle the health monitoring for you.
If your services are running on Kubernetes, you probably know the liveness configuration. When using livenessProbe
, Kubernetes periodically checks if your service is up (by sending an HTTP request, checking if a port is open, or executing a command), and even restarts it if the probe fails a few times, making it proactive.
livenessProbe:
httpGet:
path: /health
port: 8080
Application Performance (APM)
Application Performance Monitoring is a set of tools that help to track the performance and availability of applications.
APM is a big deal — and for a really good reason, It helps you understand how your app is performing in real-time. Here are the key APM metrics you should alert on:
- Apdex
Application Performance Index, is a metric that measures user satisfaction with our app’s performance. It’s a score from 0 to 1, where 0 is total frustration and 1 is total satisfaction. It’s user-friendly and easy to understand. The computation is based on a threshold (T), and the number of requests in a period where the response time was ≤ T (This is a simplified explanation). Most APM tools (Datadog, New Relic, Grafana) support this out of the box. Define a reasonable threshold — and alert when things drop.
- High Error Rate (5xx, usually):
Spikes in error rates often signal a problem. Most monitoring tools support anomaly detection for these. In Prometheus, for example, you can useincrease(..{status="5.."})
to identify trends.4xx errors are less commonly monitored, but may be useful for spotting abnormal client behavior.
- Key Request / Key Transaction
Some endpoints are business-critical — like/login
or/payment
. Many APMs let you monitor these separately, with their own thresholds and alerts, so they never slip through the cracks.
User Behavior (Synthetics Canaries / Production tests)
Synthetic (or active) monitoring simulates full user flows in production, by running “Canaries” - automated scripts that mimic user actions by calling API requests and validating their response, etc. You can think of synthetics as running a bit of your CI suite continuously, in production.
Use it for high-value flows only, since it's more resource-intensive. Synthetic checks can catch issues before users even notice — making them a powerful proactive tool.
You can utilize Synthetic Monitoring in AWS (CloudWatch), Datadog, New Relic, and more.
Bonus: Frontend / Web Monitoring
Frontend monitoring is its own beast. Why? Because the frontend runs on the user's device — not on your infrastructure. That means you have to explicitly send logs back to your servers.
Thankfully, most APM tools offer client-side plugins to automatically capture:
- JavaScript errors
- Network errors
- Web performance stats
- And more
Still, frontend issues can be tricky to detect because they often boil down to user experience. (“I clicked the button and nothing happened.”)
Here’s what you should track:
Web Vitals
Web Vitals (by Google) aim to quantify user experience — things like UI responsiveness and visual stability.
Some key metrics to know:
- LCP (Largest Contentful Paint) — how quickly the main content loads
- CLS (Cumulative Layout Shift) — how much the layout jumps around
Most frontend monitoring tools support Web Vitals, or at least a subset. More info here: web.dev/vitals
Error Rate
Track trends in JavaScript errors — things like TypeError
, NetworkError
, etc. This can help you catch bugs before users report them.
Heads up: You might get duplicate alerts here if a client-side error reflects an existing backend issue.
Browser Canaries
Some monitoring tools like AWS CloudWatch, let you define a Canary with access to a headless Google Chrome Browser. This way, we can simulate real user actions in the browser (think Selenium or Puppeteer).
They’re great for validating key journeys (like login or checkout) end to end — and catching issues before users hit them.
Summary
We covered the most important metrics to monitor in your app so that your alerts are meaningful, actionable, and worth waking up for.
It’s not all or nothing — even starting with a few of these will improve your setup dramatically.
Just remember: monitoring is never done. Keep tweaking those thresholds, refining alerts, and removing the noisy ones. Your future self (and your sleep schedule) will thank you — and so will your users.
Sleep tight,
Omer